Document Zone Content Classification Using Decision Tree and HMM
ثبت نشده
چکیده
A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. This paper describes an algorithm to classify each given document zone into one of nine different classes. Foreground and background features are studied. We used an optimized binary decision tree to estimate the maximum zone content class probability in one set while used Viterbi algorithm to find the optimal solution for a zone sequence in the other set. The training, pruning and testing data set for the algorithm include images drawn from the UWCDROM III document image database. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, text classes (of font size pt and font size pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is with a mean false alarm rate of .
منابع مشابه
A Method for Document Zone Content Classification
This paper describes an algorithm to classify each given document zone into one of nine classes and provides a protocol for its performance evaluation. The classification scheme uses an optimized binary decision tree and Viterbi algorithm for HMM to find the optimal solution. Our algorithm was trained and tested on a total of 24,177 zones within the 1600 images from UWCDROM III database. Its ac...
متن کاملA Study on the Document Zone Content Classification Problem
A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input doc...
متن کاملZone classification in a document using the method of feature vector generation
A documenZ can be divided inio zones on the basia of ils content. For ezample, a zone can be either tezt or non-tezt. This paper describes an algorithm to classify each given document zone into one of nine diflerent classes. Features for each zone such as run length mean and variance, spatial mean and variance, fraction of the total number of black pizels in the zone, and the zone width ratio f...
متن کاملDocument zone content classification and its performance evaluation
This paper describes an algorithm for the determination of zone content type of a given zone within a document image.We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision treeclassifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed.The training and t...
متن کاملImprovement of Zone Content Classification by Using Background Analysis
This paper presents an improved zone content classification method. Motivated by our novel background-analysis-based table identification research, we added two new features to the feature vector from one previously published method [7]. The new features are the total area of large horizontal and large vertical blank blocks and the number of text glyphs in the zone. A binary decision tree is us...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002